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Abstract 

A heuristic diagram of the evolution of the standard genetic code is presented. It incorporates, in a way 
that resembles the energy levels of an atom, the physical notion of broken symmetry and it is consistent 
with original ideas by Crick on the origin and evolution of the code as well as with the chronological order of 
appearence of the amino acids along the evolution as inferred from work that mixtures known experimental 
results with theoretical speculations. Suggested by the diagram we propose a Hamilton quaternions based 
mathematical representation of the code as it stands now-a-days. The central object in the description is a 
codon function that assigns to each amino acid an integer quaternion in such a way that the observed code 
degeneration is preserved. We emphasize the advantages of a quaternionic representation of amino acids 
taking as an example the folding of proteins. With this aim we propose an algorithm to go from the quater¬ 
nions sequence to the protein three dimensional structure which can be compared with the corresponding 
experimental one stored at the Protein Data Bank. In our criterion the mathematical representation of the 
genetic code in terms of quaternions merits to be taken into account because it describes not only most of 
the known properties of the genetic code but also opens new perspectives that are mainly derived from the 
close relationship between quaternions and rotations. 
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1. Introduction 

The standard genetic code[T], say the correspondence between the sequence of nucleotide bases of mRNA 
molecules and the sequence of amino acids in the ribosomal protein synthesis as occurring at the cells of most 
of the animals and plants, is now-a-days fairly well known. The mRNA bases belong to the set {A, C, G, U} 
where A stands for adenine, C for cytosine, G for guanine and U for uracil. Non-overlapping triplets of 
consecutive bases (codons) encode just one of the 20 standard amino acids (see Appendix A) or a stop signal 
each one. In principle, there is no any kind of separation between adjacent codons in the sequence. Of the 
4^ = 64 possible different codons, 61 translate into amino acids and the remaining three determine a stop 
signal. We are then speaking about a code of four letters that can form 64 words three letters each. The 
words translate into amino acids or the stop signal. 

The mechanism that performs this translation involves a very sophisticated molecular machinery which 
is no completely known yet. However, Crick's adaptor hypothesis [2] and further refinements j3l |4] are. 
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in general, widely accepted as accurate enough as to describe, at molecular level, the complex translation 
procedure in most of the cases. The image currently accepted is that tRNA molecules act as intermediaries 
(adaptors) between the template (mRNA) and the amino acids that will form the protein. The amino 
acid to be incorporated into the protein chain is covalently bonded to the tRNA 3' extreme (forming an 
aminoacyl-tRNA complex) at the time that, in another part of the tRNA chain, a triplet of nucleotide bases 
(anticodon) specifically interacts with the codon of the mRNA template that codifies the amino acid in 
question. The bases of the anticodon are just the complementary ones of the corresponding codon bases 
(read in the direction 5' —>■ 3) and the interactions manifest as hydrogen bonds between complementary 
bases. 

Skipping over for the moment the molecular details of the translation and restricting ourselves to the 
correspondence codons^amino acids in itself, we reproduce in Figure 1 a classical presentation of the 
standard genetic code. The structure of the code is evident. Each codon codifies just one amino acid or (in 
the case of the codons UAA, UAG and UGA) the stop signal. The code is degenerate in the sense that, 
except for the amino acids methionine (met) and tryptophan (trp) that are codified by a single codon each 
one, all the other amino acids are codified by two or more codons. 
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Figure 1: Text book picture of the standard genetic code. The three letters convention for the amino acids is used (see Appendix 
A) and the third base in the codons is remarked in bold. The order of the codons is in the direction 5^ 3^ The codon 

AUG besides to codify the amino acid methionine (met) also determines the starting point within the mRNA sequence for the 
protein synthesis. 


One interesting related question that has received some attention is the origin and evolution of the genetic 
code. The proposals in this direction are obviously rather speculativeHowever, Crick's scenario [TU] 
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according to which originally only a few amino acids were coded by most of the possible three bases codons 
and that, in subsequent steps, some of those codons were substituting the amino acid they coded by a new 
one until eventually the code became frozen in its present form, seems reasonable and very attractive. In 
particular, the idea of an increasing number of amino acids to be coded, can be correlated with the studies 
on the evolution of the amino acids abundance [TT] [T^]. 

A step further in relation with the genetic code includes several efforts done in order to give mathematical 
models for describing the present structure of the code and how it has evolved in order to reach this 
state [13lll5) . The main mathematical tools are tensor algebras and group theory. In particular, in Ref. 
m the authors use the physical concept of broken symmetry to find a mathematical group with a 16- 
dimensional representation (the highly degenerate primitive code) which can be written as the product of 
simpler groups that describe the pattern of redundancies observed in Figure 1. The approach gives a very 
elegant physical explanation of the code degeneration. However, perhaps because it concerns the application 
of a relatively complicated mathematical tool to a subject dominated by researchers with main formation 
in disciplines other than Mathematics and Physics, the work has been taken just as a valuable exercise in 
classification [T51 ITT] . 

In this work we propose a mathematical description of the genetic code too, but it is based on a tool that, 
in our judgement, is very friendly and, at the same time, very powerful as to open new perspectives beyond of 
simply giving a representation of the code structure. We are talking about the Hamilton quaternions [HI [19]. 
These mathematical objects are a sort of generalization of the complex numbers and obey an algebra in 
many aspects similar to theirs but with the very important (for our purposes) property that the product 
is, in general, non commutative (see Appendix B). In addition, the quaternions are ideal for representing 
rotations with important advantages over the classical matrix representation. This fact has of course already 
been recognized by bioinformaticians in writing routines involving the tertiary structure of proteins. We 
must mention that Petoukhov has also applied quaternions to descrbe the genetic code but from a very 
different point of view|20j. 

Our journey starts by presenting in the next Section a diagram for the evolution of the genetic code 
that incorporates the concept of broken symmetry in a way that resembles the energy levels of an atom. 
Actually, our interest is in the present form of the code, however the evolution diagram gives a picture of 
the correspondence bases triplets^amino acids that will help us with the mathematical representation of 
this correspondence by means of quaternions. Moreover, despite the high degree of speculation that exists 
in any model for the origin and evolution of the genetic code, we can give to our diagram an interpretation 
which is consistent with the above mentioned ideas by Crick on the subject [ID]. Thus, inspired by this 
diagram, in Section HI we proceed to represent the relationship between the codons and amino acids by 
using quaternions. First we assign an integer quaternion (Lipschitz integer) to each one of the four nucleotide 
bases and then, suggested by the diagram structure, we consider a codons function that gives as result the 
assignation of a quaternion to each one of the amino acids. The explicit form of this function involves 
simple quaternionic operations (products and sums) that automatically accounts for the degeneration of 
amino acids encoded by four, three or two codons and includes, in addition to the quaternions assigned to 
the four bases, an extra number of quaternions, related with the splitting of the ’’atomic levels” due to the 
symmetry breaking during the evolution, which, in principle, are indeterminate. These extra quaternions 
are determined by demanding that the degeneration for amino acids encoded by more than four (concretely 
six) codons be also verified. In order that this scheme works in practice we need to explicitly give the four 
quaternions for the bases. Of the infinitely many options the one we choose clearly has a Pythagorean 
flavor: we consider a subset of four quaternions from the complete set of eight prime integer quaternions 
with norm 7. The subset we take does not contain pairs of conjugate quaternions, four being the maximum 
cardinality for a subset with this property |2T|. Once a quaternion of this subset has been assigned to each 
of the four nucleotide bases, the quaternion corresponding to each amino acid is directly determined by the 
above mentioned function. This way the quaternionic description of the genetic code is completed. In order 
to remark the potentiality of the quaternionic representation of amino acids for opening new perspectives 
beyond the description of the genetic code degeneration, we appeal to another fundamental question: the 
protein folding problem, say the establishment of the native tertiary structure of the protein from the 
knowledge of its amino acids sequence (primary structure) |22l 123] . The protein folding problem is per se a 
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phenomenal task that in some sense can be considered as experimentally solved through X-ray diffraction, 
Nuclear Magnetic Resonance and other techniques. However, theoretically the problem remains unsolved 
and a lot of work has been done by many researchers since the middle of the past century in order to develop 
a computational procedure that allows predicting the tertiary structure of a given protein from its amino 
acids sequence. Here we avoid to mention the lot of methods proposed to attack the question and simply 
give our own, maybe rather heuristic, approach as to show the advantages of associating amino acids with 
quaternions. This will be done in Section IV were we show the procedure that we have designed in order 
to go from the amino acids quaternions to the coordinates of the backbone alpha-carbon atoms of a protein 
whose spatial structure we assume is the native one for the given amino acids sequence. These coordinates 
can be compared with those experimentally obtained as given in the Protein Data Bank (PDB)[53]. The 
procedure involves a set of real quaternions associated with the order of the amino-acids in the chain so 
that each amino-acid in a protein is represented by an integer quaternion (type quaternion) and a real 
quaternion (order quaternion). If this quaternions are the same ones for all the proteins, then the protein 
folding problem would be solved. In this work we limit ourselves to show how the type and order quaternions 
can be used to transform the primary structure of a given protein into its spatial configuration. The problem 
of obtaining the set of order quaternions which is adequate to all proteins (if it exists), say the possibility 
of solving the protein folding problem, is left for future work. 

Two Appendices, one with the one and three letters convention for identifying the 20 standard amino 
acids and another one with the main properties of the quaternions are finally given for completeness. 

2. A diagram for the evolution of the genetic code 

In Figure 2 we show the diagram that we propose to take account of the evolution of the genetic code. 
It is mainly inspired in pioneering ideas by Crick [TO] and also in the physical concept of broken symmetry, 
first applied in relation with the genetic code by Hornos and Hornosfl^. 

According to Crick if the genetic code is at present time a triplet code, in the sense that the reading 
mechanism moves along three bases at each step, then it must always have been a triplet code since otherwise 
a loss of Darwinian fitness can occur. Thus we assume that the codons were always formed by three bases 
of the set {A, C,G,U}. We must mention that Crick also have analyzed the plausibility of primitive nucleic 
acids constituted by just two bases. However even if this were the case, since the passage to a four bases 
system had to occur in some moment of the evolution without to substantially alter the message carried 
by the old two bases chain (Principle of continuity), we can take the four bases alphabet as being always 
available since a given moment at the origins of the code. Therefore we accept that since the beginning 
codons are triplets of bases chosen from the set {A,C,G,U}. Moreover, we consider that, in the first 
evolution steps, only the second base of the codon was effective in codifying amino acids. Accordingly only 
four amino acids could be codified, each one by one of the four bases G, G, U and A independently of 
which the first and third bases are. In the diagram this fact is denoted with a rectangle containing the four 
letters. This is consistent with Crick's suggestion that only a few amino acids were coded at the beginning. 
According with the diagram, G would codify alanine (A); G, glycine (G); U, valine (V) and A aspartic acid 
(D) whatever the first and third bases are. It is worth noting here that the four amino acids that we assume 
were the first ones to be codified are the first four in the Trifonov|12j consensus temporal order scale for 
the appearance of the amino acids (column of natural numbers in Figure 2). The four amino acids A, G, 
V and D were also the first four that appeared under simulation of the primitive earth conditions in Miller 
experiments . 

As the left part of diagram shows, our version of the primitive code is highly degenerate: in principle 
each of the four amino acids, A,G,V and D, could be encoded by 4^ = 16 codons. Physically the idea of 
degeneration is closely related with the concept of symmetry and a very illustrative form to think about 
these concepts is by doing an analogy with the energy levels of an atom. In our case we would have four 
levels indexed each one with the letter corresponding to the second codon base, say G, G, U and A (main 
quantum number). We thus assume that, as the code evolves, the symmetry that causes that the amino 
acid codification be independent of the first base of the codon, disappears for some reason. The reason could 
be that with time the recognition mechanism becomes more precise as to differentiate between two codons 
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Figure 2: Authors proposal for the genetic code evolution. The one letter convention for amino acids is used (see Appendix 
A). The direction of the temporal evolution is from left to right. Rectangles with two or more bases implies degeneration with 
respect to those ones. The broken lines link different sets of codons that encode the same amino acid in the case of sixfold 
degeneration. Arrows and common lines indicate what codons follow codifying the same amino acid and what will start to 
codify a new one, respectively, after the symmetry is broken (see text). The natural numbers at the right side of the diagram 
give the temporal order of the amino acids in the Trifonov consensus scale[12). 
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with distinct first base. Because of this symmetry breaking, a part of the degeneration also disappears. 
In the diagram each of the four initial levels splits into four new levels, one for each of the possible bases 
(C, G, U and A) at the first place of the codon (secondary quantum number). Now we have a total of 
16 levels indexed each one by two letters (the first and second bases of the codon). Each level is fourfold 
degenerate in the codons third base. One of the new levels follows codifying the same amino acid as before 
that the level splits whereas the other three codify a new amino acid each. We indicate with an arrow the 
four groups of codons that conserve the amino acid and with a simple line those that substitute the amino 
acid by a new one. Note that the codons that follow codifying the same amino acid are those whose Hrst 
base is guanine (G). This is consistent with the above mentioned temporal order of appearance and with the 
present time correct assignation of amino acids in the case of fourfold degeneration as is shown in Figure 1. 
This way 9 new amino acids (that with the old four sum 13) and the stop signal are coded. Note also that 
we assume that the amino acids serine (S) and leucine (L) at that moment were codified by two groups of 
codons: S by UC {third base arbitrary) and AG {third base arbitrary), whereas L by GU {third base arbitrary) 
and UU{third base arbitrary). 

As the code follows evolving it suffers new breaking of symmetry so that the third base of some codons 
bring into use or, in the atomic analogy, some of the fourfold degenerate levels split into two levels each one 
twofold degenerate. Those levels pointed out with an arrow follow codifying the same amino acid whereas 
the other levels substitute it for a new one. Eventually, in subsequent steps, a few of the twofold degenerate 
levels split once more given two non-degenerate levels each. This is the case of codons that codify methionine 
(M), tryptophan (W) and (again) the stop signal. The case of isoleucine (I) is a particular one since the 
split level coincides with the twofold one which represents the two codons that follow codifying the same 
amino acid. This way, isoleucine is the only amino acid which is coded by three codons. The stop signal is 
also threefold degenerate since it is coded by two groups of codons one twofold degenerate and the other one 
non-degenerate. At this step of the evolution the code frozen to give its present form. It is worth mentioning 
that the code evolution gives as a particular result that the amino acids serine (S), arginine (R) and leucine 
(L) are at present coded by two groups of codons each one. In the three cases one of the groups is fourfold 
degenerate and the other one is twofold degenerate, so that these amino acids are the only three which are 
sixfold degenerates. We point out this property in the diagram with a broken line linking the two groups of 
codons. The two groups of codons that codify the stop signal are also linked by a broken line. 


3. Mathematical representation of the genetic code 


We proceed now to describe the genetic code by using quaternions. Dehne the sets: 


B = {G,G,U,A}, 


( 1 ) 


and 


A = {P,A,S,T,R,G,C,W,L,V,F,I,M,H,Q,D,E,Y,N,K,Stop} 


( 2 ) 


H7,red.(Z) = {(2,I,l,l), (2,-1,1, 1), (2,1,-1,1), (2,1,1,-!)}. (3) 

We propose a quaternionic representation of the genetic code according to the following scheme: 

^ A 

i } (4) 

H|,red.(Z) ^ H(Z) 

where H (Z) denotes the set of integer quaternions (see Appendix B). B^ is the set of the 64 codons and we 
assume that the correspondence —> A is the present day standard genetic code as described by Figure 1, 
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whereas the function —>■ (Z) assigns to each codon a triplet of quaternions of the set red. (Z) 

(the subindex red. is for reduced). This set is a maximum cardinality subset of 

H 7 (Z) = {(ao, oi, 02 , 03 ) : oq, oi, 02,03 e Z; Og + + 03 + Og = 7, oq > 0 and even} 

with the property that it does not contain pairs of conjugate quaternions. The set H 7 (Z) has 7+1 = 8 
elements[lT] and so H 7 ^ red. (Z) has 4 quaternions as it should be. It is worth-noting that all the integer 
quaternions in H 7 (Z) are prime quaternions in the sense that they can not be expressed as the product of 
two integer quaternions if neither of them can be the unit quaternion (1, 0, 0, 0). This is consistent with the 
fact that an integer quaternion is prime if and only if its norm is a prime number|21j. Note that taking the 
nucleotide bases as prime quaternions gives them a certain character of elemental molecules. Apart from 
this, the election of H 7 ^ red. (Z) niay seem rather arbitrary. However we are just looking for a quaternionic 
representation of the genetic code so that, whatever the set of quaternions that we assign to the codons 
is, the important issue is that the function Hi) (Z) —> H (Z) preserves the essential properties of the 
correspondence —?> A. 

In what follows, in order to simplify the notation, we assign natural numbers to identify the bases and 
the amino acids: C—>■!, G—>-2, {7—>-3, A—and P—>■ 1, A—2, S—3, T—4, R—5, G—>■ 6 , C—>■ 7, 
8 , 9, 10, 11, I^ 12, 13, 14, 15, 16, 17, 18, 19, 20, 

Stop—>■ 21. 

Inspired by the diagram of Figure 2 we define the quaternionic function 

„g,(Z)^H(Z) 
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^From these expressions we can appreciate the importance of working with objects that obey a non 
commutative algebra. In fact, if the quaternions product where commutative then amino acids A and R 
would have associated the same quaternion and the same would occur with S and L. 

In Eq.0, the quaternions 7 i;jk accounts for the level splitting when the second base of codon is i and 
the third base is jk= 13 {CU) or 24 (GA). Analogously, the quaternion (5i:j accounts for the level splitting 
when the second base of the codon is i and the third base is j= 2 (G) or 4 (A). Thus, in principle we have 
as unknown quaternions 72;i3, 72;24 , 73;i3, 73;24 , 74;i3, 74;24 and 62 - 2 , S 2 - 4 , ^3;2 and (53;4. 

Of the 10 unknown quaternions we can find 5, say 7273, 72;24 , 73;i3) 73;24 ) 74;24j by requiring that 
those amino acids which are coded by two different groups of codons (case of codons sixfold degenerates 
or codons that codify the stop signal) have associated an unique quaternion and also that the two ways to 
reach isoleucine (I) give the same quaternion (see Figure 2), so we must solve the system 
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The solution is: 


(7) 


7273 = qsQi - <7492 

72;24 = qiq 2 - <7492 

7373 = 9 i 93 — <7393 + <537 (8) 

73;24 = 9193 - 9393 

74;24 = 9392 + 9i92 ~ 9492 ~ 9394 + <^2;4- 

To obtain the quaternions 62 ; 2 , ^ 27 > ^37 and 1537 we assign to those levels that can not split more (non 
degenerate levels) the product of the quaternions associated with each of the corresponding bases: as = 
939292; Q:i3 = 949392; 0:21 = 939294; Q!i2 = 949394- This way we have 


<^2;2 = 939292 — 9392 — 72;24 
<^37 = 949392 - 9493 - 73 74 

<^27 = 939294 - 9392 - 7274 (9) 

<^37 = 949394 - 9493 - 7374. 

Finally for the remaining unknown quaternion 7473 we propose: 


7473 — —7474- 


( 10 ) 


Eqs.@, (§, @ and ( | 10 [ ) solve completely the problem of assigning quaternions to the amino acids 
in such a way that the pattern of redundancy of the genetic code is verified. Taking: <71 = (2,1,1,1), 
q 2 = ( 2 , — 1 , 1 , 1 ), <73 = ( 2 , 1 , — 1 , 1 ) and <74 = ( 2 , 1 , 1 ,- 1 ), we explicitly obtain 
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( 11 ) 


We will denote (Z) the set of quaternions assigned to the amino acids as given by Eq.(ll |. 





At first sight this set of quaternions could seem to say nothing special by itself, however when we watch it 
more carefully we start to discover some patterns of regularity or symmetries. The first thing that we observe 
is that the norm of all these quaternions is odd: N (a^) = Oq + + a| = 1 mod( 2 ) (f = 1, 2 , • • • , 21 ) 

and can roughly be taken as a measure of the information needed to codify the corresponding amino acid 
in the sense that the larger the norm the larger the necessary information. In fact, taking into account 
the multiplicative property of the quaternions norm we can easily see from Eq. ([^ that those quater¬ 
nions associated with amino acids which need just the first and second codon bases to be recognized, say 
ai, Q!2,03,04, 05, as) ckg and oio, have as norm N (ai) = N {qjsq^) = N (q/s) N {q^^) = 49 whereas those 
which need of the three bases to that effect, say the quaternions Og and 043 corresponding to the amino 
acids methionine (M), tryptophan (W) and also 042 associated with the amino acid isoleucine (I) and Ogi 
with the stop signal, both coming (in one of two possible ways) from a non degenerate level (see Figure 
2 and also Eq. [^, have N (at) = N {qpq-yqs) = N {qp) N (q^) N (qs) = 343 . Here we have used the fact 
that the norms of the quaternions that represent the nucleotide bases are N (qp) = 7 (/3 = 1 , 2 , 3 , 4 ). If the 
information about what amino acids will be added during the protein synthesis is encoded in the quaternions 
triplets {qp, q-y,qs) then for amino acids which are determined by quaternions of the type = qpq^ the lack 
of information is compensated with the degeneration in the third base whereas for amino acids specified by 
quaternions of the form ai = qpq-yqs there is no lack of information and redundancy would be, in principle, 
not necessary. The amino acids which are twofold degenerate have norms which lie, with just one exception 
{an), in between these extreme values. 

We can also use the norm to divide the set Hq, (Z) into classes: the norm of the quaternions corresponding 
to four or sixfold degenerate levels verifies N (ai) = 1 mod( 4 ) whereas all the remaining quaternions, say 
as: Q:ii, 0:42, a43,044, 0:45, 046,0^47,048,049,020 and O24 that come from levels with lower degeneration, have 
norm that fulhlls N {ai) = 3 mod( 4 ). The exception is the quaternion corresponding to the amino-acid 
cysteine which is coded by two codons but verifies N (07) = II 3 = I mod( 4 ). At the respect we can say that 
in the euplotid nuclear variant of the genetic code the codon UGA codifies the amino acid C instead of the 
stop signal. If we consider this variant then 07 would play in some sense the role of 0:21 and vice versa and 
the exception would be the stop signal which could be eliminated of the discussion that mainly concerns with 
amino acids. However since we are actually interested into the standard code we simply take the quaternion 
ar as the exception to the rule and momentarily ignore it in our discussion here. The class of quaternions that 
verifies N {ai) = 3 mod( 4 ) can still be split into a couple of groups: one (0:45, 0:46, 0:48,0:49) with N {ai) = 3 
mod(8) and the other one {as, an, an, an, an, an, 0120, C(2i) with N {ai) = 7 mod(8). Although we have 
not clear the actual meaning of this separation we suspect that it has to do with symmetries involved in the 
translation process at molecular level. Anyway we think that these simple observations are enough as to give 
a preliminary idea about the potential usefulness of quaternions to discover hidden patterns of symmetry 
inside the genetic code. 

4. Amino acids as quaternions and the folding of proteins 

As we have seen in the previous Section, our quaternionic representation of the genetic code reproduces its 
structure, particularly the code redundancy and allows to make evident some regularity patterns. However 
the point that we wish to remark here is the special richness that gives to the description the close relationship 
between quaternions and rotations (see Appendix B). Because of the advantages of using quaternions to 
describe spatial rotations, the association of amino acids with quaternions opens new horizons beyond the 
genetic code representation. In this context, we consider the suitability of this association to take account 
of the folding of the proteins that the amino acids form. 

The primary structure of a protein of N amino acids is a sequence A4,A2,.. .,Ajv with A^ € A . The 
protein folding problem consists in obtaining from this sequence the spatial coordinates of each one of the 
atoms of all the amino acids that constitute the protein when this one is in the native -or functional- state 
(tertiary structure). As such we consider the one corresponding to the protein in physiological solution whose 
coordinates can be obtained, after crystallization, by application of, for example, X-ray diffraction methods. 
That is the case of most of the proteins whose coordinates are stored at the PDB. In principle we restrict 
ourselves to determine the coordinates for just the alpha-carbon atoms of the chain which is not a severe 
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Figure 3: Histogram for the distance dc^_c^ between adjacent alpha-carbon atoms. The distances were calculated from 
the alpha-carbon atoms coordinates corresponding to a sample of 110 proteins of different length stored at the PDB (31332 
pairs of adjacent alpha-carbon atoms). The mean value and the standard deviation are {dCc—Ca) ~ 3.801 A and crCa—Ca ~ 

^ — (dCc-Ca )^] ^ = 0.061 A, respectively. 


restriction since it is known that there exist very efficient algorithms for going from this trace representation 
to the full atoms one[25j. We also take into account that, in our quaternionic representation, the amino acids 
sequence is expressed as a sequence of quaternions pi,P 2 ,- ■ -jPn with pi G Hq, (Z). Under these conditions 
we proceed now to present an algorithm to determine the spatial coordinates of the alpha-carbon atoms of 
the protein. 

First we observe that although adjacent alpha-carbon atoms are not covalently bonded their distance 
is notably stable and take very similar values for all the pairs within a given protein and also for those 
belonging to different proteins, as the histogram of Figure 3 shows. So in our calculations we assume that 
all these distances are equal to a unique value dc^-Ca = 3-80 A. Thus we determine on the unit sphere with 
center at the origin a point for each of the amino acids (alpha-carbon atoms) in the protein sequence. To the 
last one we assign directly the origin, the preceding one is located at the intersection between the axis z and 
the sphere surface (versor Fj). To each of the remaining alpha-carbon atoms we assign a point on the sphere 
surface that results of rotating the versor Cz by a quaternion (see Appendix B). For the jth alpha-carbon 
atom in the sequence, the quaternion responsible of the rotation is denoted fij {j = 1, 2, • • • ,N— 2). We 
then expand the chain of alpha-carbon atoms from their location on the sphere into the back-bone protein 
three dimensional configuration (See Figure 4) by means of the following iterative procedure, where initially 
the j 's are on the sphere surface: 

do i = l,iV- 2 
6r = fi+i 
do j = I, i 
Tj = fj + Sf 
end do 
end do 

According to the algorithm, the distance between adjacent alpha-carbon atoms is the unit so, to establish 
the correct distance, we must multiply the final calculated coordinates by dc^-Ca- 

It remains to determine how to calculate the quaternions (j = 1, 2, • • • , — 2). We do this in a 

somewhat heuristic way. We take into account that the jth amino acid interacts in some way with the 
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Figure 4: Development of the alpha-carbon atoms backbone of a hypothetical protein of length TV from its position on the 
sphere surface into its spatial configuration (schematic). The last two alpha-carbon atoms, as well as some of the first ones, 
are labelled by their order number inside the sequence. 


j — 1 previous amino acids in the sequence and also with the N — j subsequent ones. Of course that in 
these interactions the effect of the medium should be incorporated in some form, for example in the form of 
effective interactions between amino acids. Actually we are trying for a sort of decodification and so we are 
not directly interested into the detailed form of the interactions, but we recognize that in any codification 
of information that involves those interactions, some trace of their general form should be. In general it is 
reasonable to think that the global interaction includes two body, three body,..., until N body (effective) 
interactions so by analogy we choose with generality for /3j the normalized version of the quaternion 


with 


and 


‘S'j 1 ^ ^ ^rPri ^j.,2 ^ ^ ^rPr^sPst ‘ ‘ ‘ C\PiC2P2 ‘ ‘ ‘ ^j — lPj—1 


( 12 ) 


(13) 


<?> - 


'y ^ CrPr ; 

f-t-l<r< 


Q> 

^ 3,2 


— y ^ ^rVr^sVs-i 
j+l<r<s<N 


, = Cj+iPj + iCj+2Pj+2 • ■ ■ CnPN, 


(14) 


where Cr € H (M) (r = 1, 2, • • • , N) are in principle unknown real quaternions to be determined. It is worth 
mentioning that in our election of the form of Eq.(12| we have taken into account the non commutativity 
of quaternions too. 

Even for proteins of length N relatively small, the memory and computation time required for evaluating 


the unknown quaternions Ci,C 2 , • • • ,cn using the complete expression given by Eq.(12| for the 13j 's are too 


large, at least for our computational facilities. Thus in the calculations here we use the simplest version: 


3j = s<iP3 +P 3 SI 1 , 
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Figure 5: Trace representation of the alpha-carbon atoms backbone for the small proteins 2BFI and IGCN. Red (dark grey) 
tube: from the coordinates obtained using our procedure. Cyan (light grey) ribbon: from the coordinates stored at PDB. 



Figure 6: Trace representation of the alpha-carbon atoms backbone for the protein 2CK5. Red (dark grey) tube: from the 
coordinates obtained using our procedure. Cyan (light grey) ribbon: from the coordinates stored at PDB. 


that, in our analogy, corresponds to consider just pair interactions in the protein total potential energy. 

Here we adjust the unknown quaternions by means of an optimization technique. As such we use the 
particle swarm optimization (PSO) procedure of Kennedy and Eberhart[26] taking as function of fitness the 
difference between the coordinates of the alpha-carbon atoms calculated following the previous procedure and 
the corresponding experimental ones as read from the PDB. We take the rmsd (root-mean-square deviation) 
as a measure of this difference, using to that effect Bosco K. Ho's implementation of Kabsch algorithm|27|. 
This way we assign to each amino-acid in the primary structure of the protein, two quaternions: an integer 
quaternion belonging to the set Hq, (Z) {type quaternion) and a real one {order quaternion) according to its 
position inside the protein chain. 

In figures 5 to 8 we show the result of the application of our procedure to five small peptides and proteins: 
in Figure 5 the synthetic peptide amyloid fibril (PDB ID: 2BFI - length: 12 amino acids) together with the 
hormone glucagon (PDB ID: IGCN - length: 29 amino acids); in Figure 6 the ion channel inhibitor oskl 
toxin (PDB ID: 2CK5 - length: 31 amino acids); in Figue 7 a type III antifreeze protein (PDB ID:1HG7 

- length: 66 amino acids); in figure 8 the ’’hydrogen atom” of proteins, say myoglobin (PDB ID: IMBN 

- length: 153 amino acids). The two proteins of Figure 5 were adjusted simultaneously so that the order 
quaternions for 2BFI are the same ones as the first 12 of IGCN. The 29 order quaternions we have obtained 
for IGCN differ of the first 29 ones of the remainder proteins instead. With respect to this last fact we 
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must mention that, at least within the error (rmsd) considered here, the set of quaternions we found for 
a given protein by fitting the alpha-carbon atoms coordinates is not unique. This is an important point 
since otherwise the possibility of finding a common set of order quaternions valid for all the proteins would 
be definitively closed. In the figures we compare the chains of alpha-carbon atoms calculated with our 
algorithm with the corresponding ones obtained from the coordinates stored at PDB. The resultant rmsd's 
are: 0.06 A for 2BFI; 0.26 A for IGCN, 0.14 A for 2CK5, 0.29 A for 1HG7 and 0.79 A for IMBN. 



Figure 7: Trace representation of the alpha-carbon atoms backbone for the protein 1HG7. Red (dark grey) tube: from the 
coordinates obtained using our procedure. Cyan (light grey) ribbon: from the coordinates stored at PDB. 



Figure 8: Trace representation of the alpha-carbon atoms backbone for the protein IMBN. Red (dark grey) tube: from the 
coordinates obtained using our procedure. Cyan (light grey) ribbon: from the coordinates stored at PDB. 
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For 2BFI, IGCN and 1HG7 we have reconstructed the full-atom protein models from their alpha-carbon 
atoms representations using Rotkiewicz and Skolnick algorithm fPULCHRA) [25] . The results are shown in 
figures 9, 10 and 11 were we also display the corresponding proteins as obtained from the PDB coordinates. 



Figure 9: Full atom line representation of the peptide 2BFL Red (dark grey): reconstruction from the alpha-carbon atoms 
backbone coordinates (obtained with our procedure) using the method of Ref. |25| . Cyan (light grey): from the coordinates 
stored at PDB. In the rebuilt protein the hydrogen atoms do not appear. 



Figure 10: Full atom line representation of the protein IGCN. Red (dark grey): reconstruction from the alpha-carbon atoms 
backbone coordinates (obtained with our procedure) using the method of Ref. m- Cyan (light grey): from the coordinates 
stored at PDB. In the rebuilt protein the hydrogen atoms do not appear. 

It must be remarked again that in this work we simply have shown a way to pass from the primary to the 
tertiary structure of the proteins assuming as known the corresponding order quaternions. These quaternions 
were obtained by fitting the coordinates of the alpha-carbon atoms obtained following our algorithm with 
the corresponding ones stored at PDB. The problem of using the procedure here described in order to predict 
the tertiary structure of proteins just from their amino acids sequences, which implies to know a priori a 
unique set of order quaternions that be adequate for all proteins, is left for future studies. Despite this 
important question, we believe that the results we have obtained until now already give a good idea about 
the usefulness of associating amino acids with quaternions, this being the main objective of this Section. 
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Figure 11: Full atom line representation of the protein 1HG7. Red (dark grey): reconstruction from the alpha-carbon atoms 
backbone coordinates (obtained with our procedure) using the method of Ref. m- Cyan (light grey): from the coordinates 
stored at PDB. In the rebuilt protein the hydrogen atoms do not appear. 


5. Conclusions 

In this work we have presented a mathematical representation of the standard genetic code. Starting 
from a set of four prime integer quaternions (one for each of the nucleotide bases that form the mRNA 
molecules) and guided by a heuristic diagram that we propose for the evolution of the code, we introduce a 
function that assigns an integer quaternion (type quaternion) to each codon (represented by a triplet of the 
prime integer quaternions) and preserves the main properties of the genetic code. The diagram we introduce 
for describing the evolution of the genetic code is based in pioneering ideas by Crick and incorporates, in 
a way that resembles the energy levels of an atom, the physical notion of broken symmetry. The objects 
that we use for performing the mathematical representation of the code, the Hamilton quaternions, have as 
remarkable properties the fact that they verify a non commutative algebra and their capability for describing 
spatial rotations. In particular, this last property gives a special character to the representation in the sense 
that it allows to develop a procedure for going from the primary to the tertiary structure of proteins. 
To this effect we introduce a set of real quaternions (order quaternions) that, together with the integer 
type quaternions, univocally identify each amino acid of the proteins. Given an amino acids sequence we 
present an algorithm that determines the coordinates of the alpha-carbon atoms of the corresponding protein 
using the type and order quaternions. However here we simply adjust the order quaternions in order to 
reproduce the experimental coordinates stored at PDB. As already was commented above, we postpone for 
future studies the question of searching for a set of order quaternions which be common to all the proteins, 
say the possibility of approaching the protein folding problem by using our procedure. In our criterion 
this possibility distinguishes the above quaternionic representation of the genetic code among the diverse 
reported mathematical representations. 
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Appendix A: One and three letters convention for the 20 standard amino acids 


Amino acid 

Three letter 

One letter 

alanine 

ala 

A 

arginine 

arg 

R 

asparagine 

asn 

N 

aspartic acid 

asp 

D 

cysteine 

cys 

C 

glutamic acid 

glu 

E 

glutamine 

gin 

Q 

glycine 

giy 

G 

histidine 

his 

H 

isoleucine 

ile 

I 

leucine 

leu 

L 

lysine 

lys 

K 

methionine 

met 

M 

phenylalanine 

phe 

F 

proline 

pro 

P 

serine 

ser 

S 

threonine 

thr 

T 

tryptophan 

trp 

W 

tyrosine 

tyr 

Y 

valine 

val 

V 
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Appendix B: Hamilton quaternions 


Quaternions were invented by mathematician William Rowlan Hamilton[T8l [ 19 ] in 1843 as a generaliza¬ 
tion of the complex numbers with the aim of describing rotations in the space in the same sense as complex 
numbers describe rotations in the plane. Here we give for completeness some of the main properties of 
quaternions. We concentrate ourselves into their definition, the algebra they fulfill and their relation with 
rotations in the space [ 28 ]. 

Definition 

A quaternion q is an ordered list of four numbers: q = ( 00 , 01 , 02 , 03 ) with 00 , 01 , 02,03 € K. In the 
particular case in that the four numbers are integers we talk of integer quaternions (Lipschitz quaternions). 
Alternatively we can introduce the placeholders i, j, k and represent the same quaternion as g = oo -f oii -f 
a 2 j + osk. The placeholders i, j, k verify the product rules 

ii= -1 jj = -1 kk = -1 

ij= k jk = i ki = j 

ji= - k kj = - i ik = -j 

Note that the placeholders play for quaternions a role in some sense similar to that of the imaginary unit 
i = 'f—l for the complex numbers. In this context the triplet (01,02,03) would be the ’’imaginary” part of 
the quaternion. Defining the (real and imaginary) quaternions qu = (oq, 0 , 0 , 0 ) and qj = (0,01,02,03) ,we 
can write: q = qn + qi- 

Algebra 

Let s be a real number and q = (oq, oi, 02, 03), p = (60, 61,62, ^3) and r = (cq, Ci, C2, C3) quaternions, we 
give here the definition of a few operations: 

- Conjugation: q = (oq, —oi, —02, —03) 

- Scalar multiplication: sq = ( 500 , 501 , 502 , 503 ) 

- Addition of quaternions: q-\- p = (oq -f &o, oi -I- 61,02 -f &2, 03 -I- 63) 

- Multiplication of quaternions: qp = r where 


Co = Oo^O - Ol^l - 02^2 - 03^3 
Cl = Oq^i -f Oi^o + 02^3 ~ 03^2 
C2 = 0062 — 0163 -f 02^0 + asbi 
C3 = ao^3 + 0162 - a2&i + asbo 


Note that this product is not commutative say, in general, qp pq. 

- Norm: N (g) = gg = gg = Oq -f Oi -f O 2 -f a§ 

A quaternion g with A (g) = 1 is called a unit quaternion. 

An important property of the norm is that it is multiplicative: N (pq) = N (p) N (g) 

- Inverse: q~^ = q/N (g) (g 7 ^ (0,0,0, 0)) 


Quaternions and 3D rotations 
If A (g) = I then the matrix 


Rq 


f Oq -f of -f o| -I- a| 
0 
0 
0 


0 

Oq -f of — o| — a| 

2 01 02 “ 1 “ 200 O 3 

20103 — 2 oo02 


0 

2 oi 02 — 200 O 3 

Oq — Oj -I- o| — O3 

20203 -f 2aoOi 


0 \ 

2aia3 + 2aoa2 
2g,2^3 — 2q,()CLi 

al-aj-al + al j 


V 
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is a rotation matrix. The oriented axis of rotation e is given by 


= 


W\^ 


with = aiCa; + 0265 / + 03 Cy where Cx, Cy and are versors along the three Cartesian axis. The angle 0 
that determines the rotation around the axis satisfies the following equation: 


tan (9/2) 


\/a[+~af+~a^ 

Go 


Moreover, if we denote with R 3 the 3x3 matrix that results when in matrix Rq the first row and the 
first column are deleted, then we can see that the quaternion q transforms by rotation a vector T^o = 
Xo^x + yo^y + Zodz into the vector = xiCx + j/iCy + ZiCz according with 


Xi 


Xo 

yi 

= i?3 

yo 



. ^0 _ 
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